{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install langchain-google-vertexai langchain_community langchain-google-community[docai] pypdf unstructured"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ingesting documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LangChain Document Loaders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TextLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most basic document loader is the `TextLoader`. It takes the file path of a text file \n",
    "as input and loads its entire content into a single `Document` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': './text_file.txt'}, page_content='Some interesting document')]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "\n",
    "loader = TextLoader(\"./text_file.txt\")\n",
    "documents = loader.load()\n",
    "\n",
    "documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The loader's output is a list containing a single Document, holding the text file's \n",
    "content without any transformations. The document's metadata includes the path to \n",
    "the original file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CSVLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can integrate structured data from various sources using the `CSVLoader` module. \n",
    "It parses Comma-Separated Value (CSV) files and transforms their tabular content into a \n",
    "list of `Document` instances.\n",
    "\n",
    "CSVLoader accepts the path to a CSV file and converts the structured data within into \n",
    "a set of documents. Each row of the CSV file is represented as an individual document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': './data.csv', 'row': 0}, page_content='product_id: A1\\nproduct_name: Product A1\\nprice: 50'),\n",
       " Document(metadata={'source': './data.csv', 'row': 1}, page_content='product_id: A2\\nproduct_name: Product A2\\nprice: 30'),\n",
       " Document(metadata={'source': './data.csv', 'row': 2}, page_content='product_id: A3\\nproduct_name: Product A3\\nprice: 40')]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_community.document_loaders import CSVLoader\n",
    "\n",
    "loader = CSVLoader(\"./data.csv\")\n",
    "documents = loader.load()\n",
    "\n",
    "documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The argument `csv_args` is a dictionary of arguments that you can pass directly to \n",
    "Python's built-in `csv.DictReader`. \n",
    "\n",
    "This is useful for customizing how the CSV file is parsed. \n",
    "In the code snippet above we passed the dictionary use `{\"delimiter\": \";\"}` to specify a \n",
    "semicolon as the delimiter instead of the default comma."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': 'Product A1', 'row': 0}, page_content='product_id: A1\\nproduct_name: Product A1\\nprice: 50'),\n",
       " Document(metadata={'source': 'Product A2', 'row': 1}, page_content='product_id: A2\\nproduct_name: Product A2\\nprice: 30'),\n",
       " Document(metadata={'source': 'Product A3', 'row': 2}, page_content='product_id: A3\\nproduct_name: Product A3\\nprice: 40')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "loader = CSVLoader(\n",
    "\"./data.csv\", \n",
    "csv_args={\"delimiter\": \",\"}, \n",
    "source_column=\"product_name\"\n",
    ") \n",
    "documents = loader.load()\n",
    "documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PDFLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To extract data from PDF files you can use the `PyPDFLoader` class. \n",
    "\n",
    "This loader extracts textual  content from PDF documents, converting it into a list of `Document` instances. \n",
    "\n",
    "Each page within the PDF is treated as an independent document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- Page 1 ---\n",
      "Provided proper attribution is provided, Google hereby grants permission to\n",
      "repr\n",
      "\n",
      "--- Page 2 ---\n",
      "1 Introduction\n",
      "Recurrent neural networks, long short-term memory [ 13] and gated\n",
      "\n",
      "--- Page 3 ---\n",
      "Figure 1: The Transformer - model architecture.\n",
      "The Transformer follows this ove\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "\n",
    "path_or_url = \"https://arxiv.org/pdf/1706.03762\"\n",
    "loader = PyPDFLoader(path_or_url)\n",
    "documents = loader.load()\n",
    "\n",
    "for i, document in enumerate(documents[:3], start=1):\n",
    "    print(f\"--- Page {i} ---\")\n",
    "    print(document.page_content[:80])\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GoogleDriveLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LangChain facilitates seamless integration with Google Drive through its  `GoogleDriveLoader´ class, enabling direct file loading. \n",
    "\n",
    "Prerequisites include enabling the Google Drive API, authorizing desktop app credentials, and installing the following libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following snippet allows the loading of all files in a google drive folders. Please remember to replace the folder_ids and the token path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "FOLDER_ID = \"my-folder-id\"\n",
    "TOKEN_PATH = \"/path/to/token.json\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_google_community import GoogleDriveLoader\n",
    "\n",
    "try:\n",
    "    loader = GoogleDriveLoader(\n",
    "        folder_id=FOLDER_ID,\n",
    "        token_path=TOKEN_PATH,\n",
    "        recursive=False,\n",
    "    )\n",
    "    \n",
    "    docs = loader.load()\n",
    "except Exception as ex: # Must provide valid credentials\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A specific list of files can be targeted using file_ids. These IDs are obtainable from individual file URLs, similar to folder IDs. \n",
    "Any file type can be loaded by specifying a base file loader through the `file_loader_cls` argument. \n",
    "\n",
    "The resulting documents from the load have the same structure as the ones loaded with the underlaying `DocumentLoader` calss. \n",
    "\n",
    "For instance, consider a list of XML files in Google Drive with known IDs. Loading them is achieved as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import (\n",
    "  UnstructuredXMLLoader\n",
    ")\n",
    "\n",
    "try:\n",
    "    loader = GoogleDriveLoader(\n",
    "        folder_id=FOLDER_ID,\n",
    "        token_path=TOKEN_PATH,\n",
    "        recursive=False,\n",
    "        file_ids=[\"file_id_1\", \"file_id_2\", \"file_id_3\"],\n",
    "        file_loader_cls=UnstructuredXMLLoader,\n",
    "        file_loader_kwargs={},\n",
    "        # Loader kwargs can be added using this argument\n",
    "    )\n",
    "    \n",
    "    docs = loader.load()\n",
    "except Exception as ex: # Must provide valid credentials\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### DirectoryLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To allow the ingestion of multiple documents stored within a directory structure, the `DirectoryLoader` module offers a convenient solution.\n",
    "\n",
    "It automates the process of  discovering and loading files with specific extensions, consolidating them into a list of `Document` instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='Some interesting document 2'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='Some interesting document 1'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='Some interesting document 0')]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_community.document_loaders import DirectoryLoader\n",
    "\n",
    "loader = DirectoryLoader(\"./documents\", glob=\"*.txt\")\n",
    "\n",
    "documents = loader.load()\n",
    "\n",
    "documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can specify a custom loader class if required with the `loader_cls` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='Some interesting document 2'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='Some interesting document 1'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='Some interesting document 0')]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_community.document_loaders import TextLoader\n",
    "\n",
    "loader = DirectoryLoader(\n",
    "    \"./documents\", \n",
    "    glob=\"*.txt\", \n",
    "    loader_cls=TextLoader\n",
    ")\n",
    "\n",
    "loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GCSFileLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "if we work with Google Cloud Storage as our file storage of choice, we can use the class `GCSFileLoader` as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_google_community import GCSFileLoader\n",
    "\n",
    "loader = GCSFileLoader(\n",
    "    project_name=\"my-gcp-project-id\",\n",
    "    bucket=\"my-bucket\",\n",
    "    blob=\"my-file.txt\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can customize the loader class making use of the `loader_func` argument.\n",
    "\n",
    "In this example, we define a `load_pdf` function that uses the PyPDFLoader specifically for PDF files. \n",
    "\n",
    "The `GCSFileLoader` handles the interaction with GCS, while `load_pdf` ensures the content is correctly parsed as a PDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import PyPDFLoader\n",
    "\n",
    "def load_pdf(file_path):\n",
    "    return PyPDFLoader(file_path)\n",
    "\n",
    "loader = GCSFileLoader(\n",
    "    project_name=\"my-gcp-project-id\",\n",
    "    bucket=\"my-bucket\",\n",
    "    blob=\"my-file.txt\",\n",
    "    loader_func = load_pdf\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To process multiple files within a GCS bucket, the `GCSDirectoryLoader` class provides a convenient solution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_google_community import GCSDirectoryLoader \n",
    "\n",
    "loader = GCSDirectoryLoader(\n",
    "  project_name=\"your-gcp-project-id\", \n",
    "  bucket=\"your-bucket-name\",\n",
    "  prefix=\"path/to/your/directory\"\n",
    ") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LangChain's `TextSplitter` interface helps to break down large documents into smaller chunks for processing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=10, \n",
    "    chunk_overlap=3 \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other text splitters include:\n",
    "\n",
    "- `CharacterTextSplitter`: Splits on a specific character (e.g., newline).\n",
    "  \n",
    "- `MarkdownTextSplitter`: Handles markdown formatting for structured documents.\n",
    "\n",
    "- `PythonCodeTextSplitter`: Preserves the structure of Python code snippets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To split a LangChain `Document` into smaller chunks after initializing the splitter class, \n",
    "you need to call the `split_documents` method. \n",
    "\n",
    "It takes a list of `Document` as the input and outputs a list of chunked `Document`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='Some'),\n",
       " Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='interesti'),\n",
       " Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='sting'),\n",
       " Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='document'),\n",
       " Document(metadata={'source': 'documents/text_file-2.txt'}, page_content='2'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='Some'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='interesti'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='sting'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='document'),\n",
       " Document(metadata={'source': 'documents/text_file-1.txt'}, page_content='1'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='Some'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='interesti'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='sting'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='document'),\n",
       " Document(metadata={'source': 'documents/text_file-0.txt'}, page_content='0')]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_community.document_loaders import DirectoryLoader\n",
    "\n",
    "loader = DirectoryLoader(\"./documents\", glob=\"*.txt\")\n",
    "\n",
    "documents = loader.load()\n",
    "\n",
    "texts = text_splitter.split_documents(documents)\n",
    "\n",
    "texts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DocAIParser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LangChain integrates this DocumentAI's Google Cloud service through the class `DocAIParser`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_google_community import DocAIParser\n",
    "from langchain_core.document_loaders.blob_loaders import Blob\n",
    "\n",
    "LOCATION = \"us\"\n",
    "GCS_OUTPUT_PATH = \"gs://BUCKET_NAME/FOLDER_PATH\"\n",
    "PROCESSOR_NAME = \"projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID\"\n",
    "\n",
    "try:\n",
    "    parser = DocAIParser(\n",
    "      location=LOCATION, \n",
    "      processor_name=PROCESSOR_NAME, \n",
    "      gcs_output_path=GCS_OUTPUT_PATH\n",
    "    )\n",
    "except ValueError:  # Define correct processor name\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the class is instantiated, we can use the method `lazy_parse` method to execute the processor and get the documents from a `Blob` in Google Cloud storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"gs://my-bucket/my-folder/my-pdf.pdf\"\n",
    "\n",
    "blob = Blob(path=path)\n",
    "\n",
    "try:\n",
    "    docs = list(parser.lazy_parse(blob))\n",
    "except NameError: # If the parser was not defined\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PDF parsing happens asynchronously on Google Cloud, and it takes time. We wait for all `operations`to finish and then parase the results as a list of `Document` objects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    operations = parser.docai_parse([blob])\n",
    "    while parser.is_running(operations):\n",
    "        time.sleep(0.5)\n",
    "    results = parser.get_results(operations)\n",
    "    docs = list(parser.parse_from_results(results))\n",
    "\n",
    "except NameError: # If the parser was not defined\n",
    "    pass"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
